A Protein Secondary Structure Prediction Method Based on Stochastic Tree Grammars
نویسندگان
چکیده
We propose a new method for predicting protein secondary structure of a given amino acid sequence, based on a classification rule automatically learned from a relatively small database of sequences whose secondary structures have been experimentally determined. Among the three types of secondary structures, helix, -sheet, and others, we concentrate on the problem of predicting the -sheet regions in a given amino acid sequence. Our method receives as input amino acid sequences known to contain -sheet regions, and train the probability parameters of a certain type of stochastic tree grammar so that its distribution best approximates the patterns of the input sample. Some of the rules in the grammar are intended a priori for generating -sheet regions and others for non-sheets. After training, the method is given a sequence of amino acids with unknown secondary structure, and predicts according to which regions are generated by the -sheet rules, in the most likely parse for the input sequence. The problem of predicting protein structures from their amino acid sequences is probably the single most important problem in genetic information processing with immense scientific significance and broad engineering applications. The prediction of protein secondary structure is considered to be an important step towards this goal. The problem of determining which regions in a given amino acid sequence correspond to each of the three categories is the classical secondary structure prediction problem and it has been attempted by many researchers using various techniques (e.g. [RS93]). There have been several attempts at the prediction of -helix regions using machine learning approaches with moderate success. The problem of predicting -sheet regions, however, has not been treated at a comparable level. This asymmetry can be attributed to the property of -sheets that their structures typically range over several discontinuous sections in an amino acid sequence, whereas the structures of -helix are continuous and their dependency patterns are more regular. To cope with this difficulty, we use a certain family of stochastic tree grammars whose expressive powers exceed not only that of Hidden Markov Models (HMMs), but also stochastic context free grammars (SCFGs). Context free grammars are not powerful enough to capture the kind of long-distance dependencies exhibited by the amino acid sequences of -sheet regions. This is because the -sheet regions exhibit both the ‘antiparallel’ dependency (of the type ‘abccba’), and ‘parallel’ dependency (of the type ‘abcabc’), and moreover, various combinations of them (as, for example, in ‘abccbaabccba’). The class of stochastic tree grammars which we employ in this paper, the Stochastic Ranked Node Rewriting Grammar (SRNRG), is one of the rare families of grammatical systems that have both enough expressive power to cope with all of these dependencies and at the same time enjoy efficient parsability and learnability (in some sense). The Ranked Node Rewriting Grammars (RNRG) were briefly introduced in the context of computationally efficient learnability of grammars by Abe [Abe88], but its formal properties as well as basic methods such as parsing and learning algorithms were left for future research. The discovery of RNRG was inspired by the pioneering work of Joshi et al. (see for example [VSJ85]) on a tree grammatical system for natural language called ‘Tree Adjoining Grammars’ (or TAG for short), but RNRG generalizes TAG just in a way that This PostScript version was created from the original authors’ English article by the Japanese Information Sciences Project (JISP), at New York University, in collaboration with the RWCP, aiming at worldwide access to the information. Every precaution has been taken to avoid errors arising from the conversion of printed documents to electronic form, however, should there be any discrepancies, the JISP bears sole responsibility for them. Email address: [email protected], URL http://jisp.cs.nyu.edu/. 1Real World Computing Partnership
منابع مشابه
PreRkTAG: Prediction of RNA Knotted Structures Using Tree Adjoining Grammars
Background: RNA molecules play many important regulatory, catalytic and structural <span style="font-variant: normal; font-style: norma...
متن کاملRNA secondary structure prediction using stochastic context-free grammars and evolutionary history
MOTIVATION Many computerized methods for RNA secondary structure prediction have been developed. Few of these methods, however, employ an evolutionary model, thus relevant information is often left out from the structure determination. This paper introduces a method which incorporates evolutionary history into RNA secondary structure prediction. The method reported here is based on stochastic c...
متن کاملPredicting Location and Structure Of beta-Sheet Regions Using Stochastic Tree Grammars
We describe and demonstrate the effectiveness of a method of predicting protein secondary structures, beta-sheet regions in particular, using a class of stochastic tree grammars as representational language for their amino acid sequence patterns. The family of stochastic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic gramma...
متن کاملPairwise RNA Pseudoknotted Structure Prediction Based on Stochastic Grammar
RNA secondary structure prediction is one of the major topics in bioinformatics. A prediction method based on a parsing algorithm for formal grammars is a promising approach. Also, it is expected that comparative sequence analysis achieves higher accuracy than the one using a single sequence since the former approach can use evolutionary information that homologous RNAs are likely to conserve a...
متن کاملStochastic Context-Free Grammars and RNA Secondary Structure Prediction
This thesis focus on the prediction of RNA secondary structure using stochastic context-free grammars (SCFG). The RNA secondary structure prediction problem consists of predicting a 2-dimensional structure from a 1-dimensional nucleotide sequence. The theory behind SCFG is explained and an overview of the research literature on various methods in the field of secondary structure prediction is g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994